Linear regression

In this tutorial we’ll learn:

Data preparation

We’re working with the Capital Bikeshare again this week, so start by reading in usage, weather, stations.

library(dplyr)
library(ggplot2)
library(lubridate)

usage = read.delim('usage_2012.tsv',
                   sep = '\t',
                   header = TRUE)

weather = read.delim('daily_weather.tsv',
                   sep = '\t',
                   header = TRUE)

stations = read.delim('stations.tsv',
                   sep = '\t',
                   header = TRUE)

Merging data

We have three related datasets to work with, but we can’t really get started until they’re combined. Let’s start with usage and weather. The usage dataframe is at the resolution of the hour, while the weather data are at the resolution of a day, so we know we’re going to have to either duplicate or compress data to merge. I vote compress, let’s summarize!

head(usage)
##   bike_id          time_start            time_end duration_mins
## 1  W01412 2012-01-01 00:04:00 2012-01-01 00:11:00             7
## 2  W00524 2012-01-01 00:10:00 2012-01-01 00:29:00            19
## 3  W00235 2012-01-01 00:10:00 2012-01-01 00:29:00            19
## 4  W00864 2012-01-01 00:15:00 2012-01-01 00:23:00             8
## 5  W00995 2012-01-01 00:15:00 2012-01-01 00:23:00             8
## 6  W00466 2012-01-01 00:17:00 2012-01-01 00:23:00             6
##                          station_start             station_end  cust_type
## 1         7th & R St NW / Shaw Library           7th & T St NW Registered
## 2       Georgia & New Hampshire Ave NW    16th & Harvard St NW     Casual
## 3       Georgia & New Hampshire Ave NW    16th & Harvard St NW Registered
## 4                       14th & V St NW Park Rd & Holmead Pl NW Registered
## 5                  11th & Kenyon St NW           7th & T St NW Registered
## 6 Court House Metro / 15th & N Uhle St    Lynn & 19th St North Registered
custs_per_day = 
  usage %>% 
    group_by(time_start = as.Date(time_start), station_start, cust_type) %>% 
    summarize(no_rentals = n(),
              duration_mins = mean(duration_mins, na.rm = TRUE))

head(custs_per_day)
## Source: local data frame [6 x 5]
## Groups: time_start, station_start
## 
##   time_start                 station_start  cust_type no_rentals
## 1 2012-01-01           10th & Monroe St NE Registered         10
## 2 2012-01-01                10th & U St NW     Casual          8
## 3 2012-01-01                10th & U St NW Registered         50
## 4 2012-01-01 10th St & Constitution Ave NW     Casual         34
## 5 2012-01-01 10th St & Constitution Ave NW Registered         20
## 6 2012-01-01                11th & H St NE     Casual          4
## Variables not shown: duration_mins (dbl)

Perfection, now we can merge! What’s the key?

# make sure we have consistent date formats
custs_per_day$time_start = ymd(custs_per_day$time_start)
weather$date = ymd(weather$date)

# then merge. see ?merge for more details about the function
weather_rentals = merge(custs_per_day, weather, 
                        by.x = 'time_start', by.y = 'date')

# check dimensions after to make sure they are what you expect
dim(custs_per_day)
## [1] 57376     3
dim(weather)
## [1] 366  15
dim(weather_rentals)
## [1] 57376    17
head(weather_rentals)
##   time_start                 station_start no_rentals weekday season_code
## 1 2012-01-01           10th & Monroe St NE         10       0           1
## 2 2012-01-01                10th & U St NW         58       0           1
## 3 2012-01-01 10th St & Constitution Ave NW         54       0           1
## 4 2012-01-01                11th & H St NE         20       0           1
## 5 2012-01-01           11th & Kenyon St NW         58       0           1
## 6 2012-01-01           12th & Army Navy Dr         12       0           1
##   season_desc is_holiday is_work_day weather_code
## 1      Spring          0           0            1
## 2      Spring          0           0            1
## 3      Spring          0           0            1
## 4      Spring          0           0            1
## 5      Spring          0           0            1
## 6      Spring          0           0            1
##                                      weather_desc temp subjective_temp
## 1 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37        0.375621
## 2 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37        0.375621
## 3 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37        0.375621
## 4 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37        0.375621
## 5 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37        0.375621
## 6 Clear, Few clouds, Partly cloudy, Partly cloudy 0.37        0.375621
##   humidity windspeed no_casual_riders no_reg_riders total_riders
## 1   0.6925  0.192167              686          1608         2294
## 2   0.6925  0.192167              686          1608         2294
## 3   0.6925  0.192167              686          1608         2294
## 4   0.6925  0.192167              686          1608         2294
## 5   0.6925  0.192167              686          1608         2294
## 6   0.6925  0.192167              686          1608         2294

Great, now we want to merge on the last dataset, stations. What is the key to link weather_rentals with stations?

final_data = merge(weather_rentals, stations, 
                   by.x = 'station_start', by.y = 'station')
dim(final_data)
## [1] 57011   152
dim(weather_rentals)
## [1] 57376    17
head(final_data[, 1:30])
##    station_start time_start no_rentals weekday season_code season_desc
## 1 10th & E St NW 2012-12-03         40       1           4      Winter
## 2 10th & E St NW 2012-09-25         57       2           4      Winter
## 3 10th & E St NW 2012-11-10         49       6           4      Winter
## 4 10th & E St NW 2012-11-12         29       1           4      Winter
## 5 10th & E St NW 2012-12-05         39       3           4      Winter
## 6 10th & E St NW 2012-11-30         31       5           4      Winter
##   is_holiday is_work_day weather_code
## 1          0           1            1
## 2          0           1            1
## 3          0           0            1
## 4          1           0            1
## 5          0           1            1
## 6          0           1            1
##                                      weather_desc     temp subjective_temp
## 1 Clear, Few clouds, Partly cloudy, Partly cloudy 0.452500        0.455796
## 2 Clear, Few clouds, Partly cloudy, Partly cloudy 0.550000        0.544179
## 3 Clear, Few clouds, Partly cloudy, Partly cloudy 0.389167        0.393937
## 4 Clear, Few clouds, Partly cloudy, Partly cloudy 0.485000        0.475383
## 5 Clear, Few clouds, Partly cloudy, Partly cloudy 0.438333        0.428012
## 6 Clear, Few clouds, Partly cloudy, Partly cloudy 0.298333        0.323867
##   humidity windspeed no_casual_riders no_reg_riders total_riders  id
## 1 0.767500 0.0827208              555          5679         6234 199
## 2 0.570000 0.2363210              845          6693         7538 199
## 3 0.645417 0.0578458             2090          4446         6536 199
## 4 0.741667 0.1735170             1097          5172         6269 199
## 5 0.485000 0.3240210              331          5398         5729 199
## 6 0.649583 0.0584708              362          5306         5668 199
##   terminal_name      lat      long no_bikes no_empty_docks fast_food
## 1         31256 38.89591 -77.02606        6              8         5
## 2         31256 38.89591 -77.02606        6              8         5
## 3         31256 38.89591 -77.02606        6              8         5
## 4         31256 38.89591 -77.02606        6              8         5
## 5         31256 38.89591 -77.02606        6              8         5
## 6         31256 38.89591 -77.02606        6              8         5
##   parking restaurant convenience post_office bicycle_parking
## 1       2         16           0           1               4
## 2       2         16           0           1               4
## 3       2         16           0           1               4
## 4       2         16           0           1               4
## 5       2         16           0           1               4
## 6       2         16           0           1               4
##   drinking_water
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0
# probably want to save this now!
write.table(final_data, 
            'bikeshare_modeling_data.tsv', 
            row.names = FALSE, sep = '\t')

# rename to something more convenient and remove from memory
data = final_data
rm(final_data)

The lm() function

The function for creating a linear model in R is lm() and the primary arguments are formula and data. Formulas in R are a little funny, instead of an = sign, they are expressed with a ~. Let’s fit the model we saw in the lecture notes: \(rentals = \beta_0 + \beta_1*crossing\). There’s a little snag we have to take care of first. Right now we’ve got repeated measures i.e. one measurement per day, so we need to aggregate again this time over date.

rentals_crossing = 
  data %>% 
    group_by(station_start) %>% 
    summarize(mean_rentals = mean(no_rentals),
              crossing = mean(crossing))

head(rentals_crossing)
## Source: local data frame [6 x 3]
## 
##                   station_start mean_rentals crossing
## 1                10th & E St NW     36.79070      122
## 2           10th & Monroe St NE     10.59167        1
## 3                10th & U St NW     71.24317        5
## 4 10th St & Constitution Ave NW     55.92603      116
## 5                11th & H St NE     35.18579       73
## 6           11th & Kenyon St NW     61.81694       20
# plot it
ggplot(rentals_crossing, aes(x = crossing, y = mean_rentals)) +
  geom_smooth(method = 'lm', size = 2) +
  geom_point(size = 4, alpha = 0.60) +
  theme_minimal()

model = lm(mean_rentals ~ crossing, data = rentals_crossing)

# view what is returned in the lm object
attributes(model)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
# get model output
summary(model)
## 
## Call:
## lm(formula = mean_rentals ~ crossing, data = rentals_crossing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.815 -21.176  -8.394  13.765 126.969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.53163    2.59354  10.615  < 2e-16 ***
## crossing     0.49159    0.07031   6.992 4.92e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.54 on 183 degrees of freedom
## Multiple R-squared:  0.2108, Adjusted R-squared:  0.2065 
## F-statistic: 48.88 on 1 and 183 DF,  p-value: 4.92e-11
# print model diagnostics
par(mfrow = c(2, 2))
plot(model)

The attributes() function can be called on just about any object in R and it returns a list of all the things inside. It’s a great way to explore objects and see what values are contained inside that could be used in other analysis. For example, extracting the residuals via model$residuals is useful if we want to print diagnostic plots like those above.

When we run summary() on the lm object, we see the results. The Call section just prints back the model specification, and the Residuals section contains a summary of the distribution of the errors. The fun stuff is in the Coefficients section. In the first row contains the covariate names followed by their estimates, standard errors, t- and p-values. Our model ends up being rentals = 15 + 0.24(crosswalks) which means that the average number of rentals when there are no crosswalks is 15, and the average increases by 1 rental for every four additional crosswalks.

We can fit regressions with multiple covariates the same way:

# lets include windspeed this time
rentals_multi = 
  data %>% 
    group_by(station_start) %>% 
    summarize(mean_rentals = mean(no_rentals),
              crossing = mean(crossing),
              windspeed = mean(windspeed))

head(rentals_multi)
## Source: local data frame [6 x 4]
## 
##                   station_start mean_rentals crossing windspeed
## 1                10th & E St NW     36.79070      122 0.1733099
## 2           10th & Monroe St NE     10.59167        1 0.1885038
## 3                10th & U St NW     71.24317        5 0.1895723
## 4 10th St & Constitution Ave NW     55.92603      116 0.1891103
## 5                11th & H St NE     35.18579       73 0.1895723
## 6           11th & Kenyon St NW     61.81694       20 0.1895723
ggplot(rentals_multi, aes(x = windspeed, y = mean_rentals)) +
  geom_smooth(method = 'lm', size = 2) +
  geom_point(size = 4, alpha = 0.60) +
  theme_minimal()

model = lm(mean_rentals ~ crossing + windspeed, data = rentals_multi)
summary(model)
## 
## Call:
## lm(formula = mean_rentals ~ crossing + windspeed, data = rentals_multi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.528 -19.319  -5.397  11.865 121.820 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -350.00979   69.62422  -5.027 1.19e-06 ***
## crossing       0.45968    0.06568   6.999 4.78e-11 ***
## windspeed   2036.29096  375.29657   5.426 1.82e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.48 on 182 degrees of freedom
## Multiple R-squared:  0.3207, Adjusted R-squared:  0.3132 
## F-statistic: 42.96 on 2 and 182 DF,  p-value: 5.226e-16

The model coefficients changed quite a lot when we added in wind speed. The intercept is now negative, and the wind speed coefficient is huge! When interpreting coefficients, it’s important to keep the scale in mind. Wind speed ranges from 0.05 to 0.44 so when you multiply 1172 by 0.05 for example, you end up with about 60, which is within the range we’d expect.

Let’s try one more, this time we’ll include a factor variable.

rentals_multi = 
  data %>% 
    group_by(station_start, is_holiday) %>% 
    summarize(mean_rentals = mean(no_rentals),
              crossing = mean(crossing),
              windspeed = mean(windspeed))

head(rentals_multi)
## Source: local data frame [6 x 5]
## Groups: station_start
## 
##         station_start is_holiday mean_rentals crossing windspeed
## 1      10th & E St NW          0    36.994012      122 0.1740921
## 2      10th & E St NW          1    30.000000      122 0.1471828
## 3 10th & Monroe St NE          0    10.681948        1 0.1883067
## 4 10th & Monroe St NE          1     7.727273        1 0.1947563
## 5      10th & U St NW          0    71.684507        5 0.1894117
## 6      10th & U St NW          1    57.000000        5 0.1947563
# plot crossings, colored by is_holiday
ggplot(rentals_multi, 
       aes(x = crossing, y = mean_rentals, color = factor(is_holiday))) +
  geom_smooth(method = 'lm', size = 2) +
  geom_point(size = 4, alpha = 0.60) +
  theme_minimal()

# plot windspeed, colored by is_holiday
ggplot(rentals_multi, 
       aes(x = windspeed, y = mean_rentals, color = factor(is_holiday))) +
  geom_smooth(method = 'lm', size = 2) +
  geom_point(size = 4, alpha = 0.60) +
  theme_minimal()

model = lm(mean_rentals ~ crossing + windspeed + factor(is_holiday), 
           data = rentals_multi)
summary(model)
## 
## Call:
## lm(formula = mean_rentals ~ crossing + windspeed + factor(is_holiday), 
##     data = rentals_multi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.310 -19.253  -6.486  12.377 131.761 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -51.54159   18.54479  -2.779  0.00573 ** 
## crossing              0.40017    0.04537   8.819  < 2e-16 ***
## windspeed           436.86213   99.43162   4.394 1.46e-05 ***
## factor(is_holiday)1  -7.52366    2.80516  -2.682  0.00765 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.9 on 366 degrees of freedom
## Multiple R-squared:  0.2335, Adjusted R-squared:  0.2272 
## F-statistic: 37.17 on 3 and 366 DF,  p-value: < 2.2e-16

The output looks a little funny now. There’s a term called factor(is_holiday)1, what does that mean? Factors are category variables and their interpretation is relative to a baseline. Our factor is_holiday only has two levels, 0 and 1, and R sets 0 to the baseline by default. So the interpretation of that term is that we can expect about 10 additional rentals when it is a holiday (i.e. is_holiday == 0) and the other variables are fixed.

The caret package

For this section, we’ll use the fully cleaned and combined data from the project-1-data-cleanup file, so make sure you’ve gone through and cleaned your data up like that, or download the clean file from here.

data = read.delim('final_modeling_data.tsv', sep = '\t', header = TRUE)

We’ll be using the caret package (short for classification and regression training) for model development because it integrates many modeling packages in R into one unified syntax. That means more reusable code for us! caret contains helper functions that provide a unified framework for data cleaning/splitting, model training, and comparison. I highly recommend the optional reading this week which provides a great overview of the caret package.

install.packages('caret', dependencies = TRUE)
library(caret)

set.seed(1234) # set a seed

Setting a seed in R insures that you get identical results each time you run your code. Since resampling methods are inherently probabilistic, every time we rerun them we’ll get slightly different answers. Setting the seed to the same number insures that we get identical randomness each time the code is run, and that’s helpful for debugging.

Train and test data

Before any analysis in this class we’ll need to divide our data into train and test sets. Check out this nice overview for more details. The training set is typically about 75% of the data and is used for all the model development. Once we have a model we’re satisfied with, we use our testing set, the other 25% to generate model predictions. Splitting the data into the two groups, train and test, generates two types of errors, in-sample and out-of-sample errors. In-sample errors are the errors derived from same data the model was built with. Out-of-sample errors are derived from measuring the error on a fresh data set. We are interested in the out-of-sample error because this quantity represents how’d we’d expect the model to perform in the future on brand new data.

Here’s how to split the data with caret:

# select the training observations
in_train = createDataPartition(y = data$rentals,
                                   p = 0.75, # 75% in train, 25% in test
                                   list = FALSE)
head(in_train) # row indices of observations in the training set
##      Resample1
## [1,]         7
## [2,]        12
## [3,]        25
## [4,]        72
## [5,]        92
## [6,]        93
train = data[in_train, ]
test = data[-in_train, ]

dim(train)
## [1] 17544   119
dim(test)
## [1] 5846  119

Note: I recommend doing all data processing and aggregation steps before splitting out your train/test sets.

Training

Our workhorse function in the caret package in the train function. This function can be used to evaluate performance parameters, choose optimal models based on the values of those parameters, and estimate model performance. For regression we can use it in place of the lm() function. Here’s our last regression model using the train function.

model_fit = train(rentals ~ crossing + windspeed + factor(is_holiday), 
                  data = train, 
                  method = 'lm',
                  metric = 'RMSE') 
print(model_fit)
## Linear Regression 
## 
## 17544 samples
##   118 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 17544, 17544, 17544, 17544, 17544, 17544, ... 
## 
## Resampling results
## 
##   RMSE      Rsquared    RMSE SD    Rsquared SD
##   26.67981  0.06043381  0.3414677  0.004150964
## 
## 
summary(model_fit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.402 -14.852 -10.088   6.122 235.311 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           16.909526   0.725825  23.297  < 2e-16 ***
## crossing               0.213221   0.006357  33.543  < 2e-16 ***
## windspeed             -0.901991   3.628303  -0.249    0.804    
## `factor(is_holiday)1` -4.664848   0.681877  -6.841 8.11e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.71 on 17540 degrees of freedom
## Multiple R-squared:  0.0622, Adjusted R-squared:  0.06204 
## F-statistic: 387.8 on 3 and 17540 DF,  p-value: < 2.2e-16
# get predictions
out_of_sample_predictions = predict(model_fit, newdata = test)

# compare predictions against the observed values
errors = data.frame(predicted = out_of_sample_predictions,
                    observed = test$rentals,
                    error = out_of_sample_predictions - test$rentals)

# eh, not so good
ggplot(data = errors, aes(x = predicted, y = observed)) + 
  geom_abline(aes(intercept = 0, slope = 1), 
              size = 3, alpha = 0.70, color = 'red') +
  geom_point(size = 3, alpha = 0.80) +
  ggtitle('out-of-sample errors') +
  theme_minimal()

Our prediction accuracy is not so great for this model. The in-sample RMSE is about 27 which means that on average the predictions are off by about 27 rentals. Let’s fit the giant model we made before:

full_model = train(rentals ~ ., 
                  data = train, 
                  method = 'lm') 

The in-sample RMSE is about 19, so definitely an improvement over the previous model, but this model is really complex and probably not going to be usable by Pronto. How can we reduce the complexity of the model, but maintain reasonable predictive accuracy?

Preprocessing

Shrinkage methods require that the predictors are normalized to be on the same scale. We can accomplish this by centering and scaling the data. You center a variable by subtracting the mean of the variable from from each observation. To scale your observations you then divide the centered observation by the variable standard deviation. Now the variable follows a standard normal distribution with mean = 0 and standard deviation = 1.

The caret package has lots of convenient functions for preprocessing data, check ’em out!

Converting factors to dummy variables

We run into some trouble if we try to just center and scale the data because its got factor variables and you can’t subtract a number from a category. We can use the model.matrix function to fix that really quickly.

no_factors = as.data.frame(model.matrix(rentals ~ . -1, data = data))

# put rentals back on
no_factors$rentals = data$rentals

full_model_scaled = train(rentals ~ ., 
                  data = no_factors, 
                  method = 'lm',
                  preProcess = c('center', 'scale'))

Coefficients estimated with normalized data have a different interpretation than coefficients from un-normalized data. In this case when the data are scaled the intercept has a better interpretation, it’s the expected number of rentals when all the predictors are at their average value. So, in this case, when all the predictors are at their average values, we expect about 21 rentals per day. In the previous full-model we had an intercept of about -28, which could be interpreted as the expected number of rentals when all the other predictors have a value of 0. That’s pretty unsatisfying for a couple reasons. First, we can’t have negative rentals! Second, for a lot of the predictors it doesn’t make sense to plug in 0’s. What does it mean to have a duration of 0? Or a temp of 0? Centering and scaling fix the non-interpret ability of the previous models.

Since we divide by the standard deviation during scaling, the non-intercept coefficients in the centered and scaled model can be interpreted as the increase in \(y\) associated with a 1 standard deviation increase in \(x\).

Model Selection

Variable combination

A simple method to reduce model complexity is to combine some of the variables. For example the dataset contains a variable for alcohol, pub and bar, likewise there’s a variable for food_court, restaurant, food_cart, and fast_food. Maybe we can retain information and remove some variables.

no_factors$food = no_factors$fast_food + no_factors$restaurant + 
  no_factors$food_court + no_factors$bar.restaurant + 
  no_factors$cafe + no_factors$food_cart 

no_factors$nightlife = no_factors$bar + no_factors$club + 
  no_factors$pub + no_factors$nightclub 

no_factors$seedy_stuff = no_factors$stripclub + no_factors$strip_club + 
  no_factors$alcohol + no_factors$check_cashing + no_factors$motel + 
  no_factors$hostel

no_factors$tourism = no_factors$theatre + no_factors$arts_centre + 
  no_factors$tourist + no_factors$school..historic. + no_factors$hotel + 
  no_factors$gallery + no_factors$artwork + no_factors$sculpture + 
  no_factors$museum + no_factors$tour_guide + no_factors$car_rental + 
  no_factors$guest_house + no_factors$landmark + no_factors$attraction + 
  no_factors$information

dim(no_factors)
## [1] 23390   133
# now remove those variables from the no_factorsset
no_factors = 
  no_factors %>%
  select(-fast_food, -restaurant, -food_court, -bar.restaurant, -cafe, 
         -food_cart, -bar, -club, -pub, -nightclub, -stripclub, -strip_club, 
         -alcohol, -check_cashing, -motel, -hostel, -theatre, -arts_centre, 
         -tourist, -school..historic., -hotel, -gallery, -artwork, -sculpture, 
         -museum, -tour_guide, -car_rental, -guest_house, -landmark, 
         -attraction, -information)

# Reduced the dataset by 31 variables!
dim(no_factors)
## [1] 23390   102

Try out your own categories, these are just a few to get you started. We’ll learn how to make categories computationally when we cover clustering.

We’ve change the dataframe, don’t forget to redefine the train and test sets!

train = no_factors[in_train, ]
test = no_factors[-in_train, ]

dim(train)
## [1] 17544   102
dim(test)
## [1] 5846  102
# how does our new full-model compare?
full_model = train(rentals ~ ., 
                  data = train, 
                  method = 'lm') 

Subset selection

We haven’t talked much about computational limitations yet, but it’s a good time to start. Selection methods can be extremely slow. Why? Because we have \(2^p = 2^{117}\) possible variable combinations. I recommend doing some combining before trying these methods. I’ll leave the combining up to you, but to make sure these models can run in less than infinite time, I’m going to remove a bunch of predictors so you get the idea.

train = 
  train %>% 
    select(rentals, cust_typeCasual, cust_typeRegistered, cust_typeSubscriber, 
           weekday1, weekday2, weekday3, weekday4, weekday5, weekday6, 
           season_code2, season_code3, season_code4, is_holiday, weather_code2, 
           weather_code3, humidity, windspeed, temp, duration, food, nightlife, 
           seedy_stuff, tourism) 

test = 
  test %>% 
    select(rentals, cust_typeCasual, cust_typeRegistered, cust_typeSubscriber, 
           weekday1, weekday2, weekday3, weekday4, weekday5, weekday6, 
           season_code2, season_code3, season_code4, is_holiday, weather_code2, 
           weather_code3, humidity, windspeed, temp, duration, food, nightlife, 
           seedy_stuff, tourism)

# forward selection
forward_model = train(rentals ~ ., 
                      data = na.omit(train),  
                      method = 'leapForward',
                      preProcess = c('center', 'scale'),
                      # try models of size 1 - 23
                      tuneGrid = expand.grid(nvmax = 1:23),
                      trControl = trainControl(method = 'cv', number = 5)) 
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
# what does this return?
attributes(forward_model)
## $names
##  [1] "method"       "modelInfo"    "modelType"    "results"     
##  [5] "pred"         "bestTune"     "call"         "dots"        
##  [9] "metric"       "control"      "finalModel"   "preProcess"  
## [13] "trainingData" "resample"     "resampledCM"  "perfNames"   
## [17] "maximize"     "yLimits"      "times"        "terms"       
## [21] "coefnames"    "xlevels"     
## 
## $class
## [1] "train"         "train.formula"
# what what should the number of variables, k, be?
forward_model$bestTune
##    nvmax
## 22    22
# what metric was used?
forward_model$metric
## [1] "RMSE"
# here's a handful of other useful plots and summaries
print(forward_model)
## Linear Regression with Forward Selection 
## 
## 17544 samples
##    23 predictor
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 14036, 14037, 14034, 14034, 14035 
## 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   RMSE SD    Rsquared SD
##    1     25.34630  0.1558734  0.5171403  0.01993912 
##    2     24.93580  0.1829453  0.5922634  0.02108161 
##    3     24.90329  0.1850447  0.6108495  0.02084754 
##    4     24.89912  0.1852932  0.6077094  0.02095107 
##    5     24.86277  0.1876819  0.6073913  0.02100862 
##    6     24.85255  0.1883723  0.6193269  0.02130686 
##    7     24.22592  0.2285633  0.5668672  0.02162475 
##    8     24.22682  0.2285179  0.5676854  0.02186734 
##    9     24.35505  0.2213029  0.7227301  0.02442361 
##   10     24.36359  0.2210779  0.7354085  0.02497915 
##   11     23.70296  0.2626084  1.2645873  0.06460206 
##   12     23.51564  0.2744719  1.0021704  0.05322519 
##   13     23.25240  0.2905890  0.7347665  0.02825663 
##   14     23.18618  0.2942423  0.6875367  0.02641883 
##   15     23.15332  0.2962349  0.6786997  0.02776622 
##   16     23.10479  0.2990525  0.6630661  0.02765034 
##   17     23.09873  0.2995969  0.6716214  0.02702222 
##   18     23.09999  0.2995307  0.6732473  0.02707303 
##   19     23.09414  0.2998882  0.6786158  0.02728513 
##   20     23.09389  0.2999094  0.6844024  0.02764143 
##   21     23.07183  0.3012558  0.6766023  0.02730813 
##   22     23.03024  0.3036193  0.6746600  0.02760871 
##   23     23.03024  0.3036193  0.6746600  0.02760871 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was nvmax = 22.
summary(forward_model)
## Subset selection object
## 23 Variables  (and intercept)
##                     Forced in Forced out
## cust_typeCasual         FALSE      FALSE
## cust_typeRegistered     FALSE      FALSE
## weekday1                FALSE      FALSE
## weekday2                FALSE      FALSE
## weekday3                FALSE      FALSE
## weekday4                FALSE      FALSE
## weekday5                FALSE      FALSE
## weekday6                FALSE      FALSE
## season_code2            FALSE      FALSE
## season_code3            FALSE      FALSE
## season_code4            FALSE      FALSE
## is_holiday              FALSE      FALSE
## weather_code2           FALSE      FALSE
## weather_code3           FALSE      FALSE
## humidity                FALSE      FALSE
## windspeed               FALSE      FALSE
## temp                    FALSE      FALSE
## duration                FALSE      FALSE
## food                    FALSE      FALSE
## nightlife               FALSE      FALSE
## seedy_stuff             FALSE      FALSE
## tourism                 FALSE      FALSE
## cust_typeSubscriber     FALSE      FALSE
## 1 subsets of each size up to 22
## Selection Algorithm: forward
##           cust_typeCasual cust_typeRegistered cust_typeSubscriber weekday1
## 1  ( 1 )  " "             "*"                 " "                 " "     
## 2  ( 1 )  " "             "*"                 " "                 " "     
## 3  ( 1 )  " "             "*"                 " "                 " "     
## 4  ( 1 )  "*"             "*"                 " "                 " "     
## 5  ( 1 )  "*"             "*"                 " "                 " "     
## 6  ( 1 )  "*"             "*"                 " "                 " "     
## 7  ( 1 )  "*"             "*"                 " "                 " "     
## 8  ( 1 )  "*"             "*"                 " "                 " "     
## 9  ( 1 )  "*"             "*"                 " "                 " "     
## 10  ( 1 ) "*"             "*"                 " "                 " "     
## 11  ( 1 ) "*"             "*"                 " "                 " "     
## 12  ( 1 ) "*"             "*"                 " "                 " "     
## 13  ( 1 ) "*"             "*"                 " "                 " "     
## 14  ( 1 ) "*"             "*"                 " "                 " "     
## 15  ( 1 ) "*"             "*"                 " "                 " "     
## 16  ( 1 ) "*"             "*"                 " "                 " "     
## 17  ( 1 ) "*"             "*"                 " "                 " "     
## 18  ( 1 ) "*"             "*"                 " "                 "*"     
## 19  ( 1 ) "*"             "*"                 " "                 "*"     
## 20  ( 1 ) "*"             "*"                 " "                 "*"     
## 21  ( 1 ) "*"             "*"                 " "                 "*"     
## 22  ( 1 ) "*"             "*"                 " "                 "*"     
##           weekday2 weekday3 weekday4 weekday5 weekday6 season_code2
## 1  ( 1 )  " "      " "      " "      " "      " "      " "         
## 2  ( 1 )  " "      " "      " "      " "      " "      " "         
## 3  ( 1 )  " "      " "      " "      " "      " "      " "         
## 4  ( 1 )  " "      " "      " "      " "      " "      " "         
## 5  ( 1 )  " "      " "      " "      " "      " "      " "         
## 6  ( 1 )  " "      " "      " "      " "      " "      " "         
## 7  ( 1 )  " "      " "      " "      " "      " "      " "         
## 8  ( 1 )  " "      " "      " "      " "      " "      " "         
## 9  ( 1 )  " "      " "      " "      " "      " "      " "         
## 10  ( 1 ) " "      " "      " "      " "      " "      "*"         
## 11  ( 1 ) " "      " "      " "      " "      " "      "*"         
## 12  ( 1 ) " "      " "      " "      " "      " "      "*"         
## 13  ( 1 ) " "      " "      " "      " "      " "      "*"         
## 14  ( 1 ) " "      " "      " "      " "      " "      "*"         
## 15  ( 1 ) " "      " "      " "      "*"      " "      "*"         
## 16  ( 1 ) " "      " "      " "      "*"      "*"      "*"         
## 17  ( 1 ) " "      "*"      " "      "*"      "*"      "*"         
## 18  ( 1 ) " "      "*"      " "      "*"      "*"      "*"         
## 19  ( 1 ) " "      "*"      "*"      "*"      "*"      "*"         
## 20  ( 1 ) "*"      "*"      "*"      "*"      "*"      "*"         
## 21  ( 1 ) "*"      "*"      "*"      "*"      "*"      "*"         
## 22  ( 1 ) "*"      "*"      "*"      "*"      "*"      "*"         
##           season_code3 season_code4 is_holiday weather_code2 weather_code3
## 1  ( 1 )  " "          " "          " "        " "           " "          
## 2  ( 1 )  " "          " "          " "        " "           " "          
## 3  ( 1 )  " "          " "          " "        " "           " "          
## 4  ( 1 )  " "          " "          " "        " "           " "          
## 5  ( 1 )  " "          " "          " "        " "           "*"          
## 6  ( 1 )  "*"          " "          " "        " "           "*"          
## 7  ( 1 )  "*"          " "          " "        " "           "*"          
## 8  ( 1 )  "*"          " "          "*"        " "           "*"          
## 9  ( 1 )  "*"          " "          "*"        " "           "*"          
## 10  ( 1 ) "*"          " "          "*"        " "           "*"          
## 11  ( 1 ) "*"          "*"          "*"        " "           "*"          
## 12  ( 1 ) "*"          "*"          "*"        " "           "*"          
## 13  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 14  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 15  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 16  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 17  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 18  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 19  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 20  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 21  ( 1 ) "*"          "*"          "*"        "*"           "*"          
## 22  ( 1 ) "*"          "*"          "*"        "*"           "*"          
##           humidity windspeed temp duration food nightlife seedy_stuff
## 1  ( 1 )  " "      " "       " "  " "      " "  " "       " "        
## 2  ( 1 )  " "      " "       " "  " "      " "  " "       " "        
## 3  ( 1 )  " "      " "       " "  " "      " "  "*"       " "        
## 4  ( 1 )  " "      " "       " "  " "      " "  "*"       " "        
## 5  ( 1 )  " "      " "       " "  " "      " "  "*"       " "        
## 6  ( 1 )  " "      " "       " "  " "      " "  "*"       " "        
## 7  ( 1 )  " "      " "       " "  " "      "*"  "*"       " "        
## 8  ( 1 )  " "      " "       " "  " "      "*"  "*"       " "        
## 9  ( 1 )  " "      " "       "*"  " "      "*"  "*"       " "        
## 10  ( 1 ) " "      " "       "*"  " "      "*"  "*"       " "        
## 11  ( 1 ) " "      " "       "*"  " "      "*"  "*"       " "        
## 12  ( 1 ) " "      " "       "*"  " "      "*"  "*"       "*"        
## 13  ( 1 ) " "      " "       "*"  " "      "*"  "*"       "*"        
## 14  ( 1 ) " "      "*"       "*"  " "      "*"  "*"       "*"        
## 15  ( 1 ) " "      "*"       "*"  " "      "*"  "*"       "*"        
## 16  ( 1 ) " "      "*"       "*"  " "      "*"  "*"       "*"        
## 17  ( 1 ) " "      "*"       "*"  " "      "*"  "*"       "*"        
## 18  ( 1 ) " "      "*"       "*"  " "      "*"  "*"       "*"        
## 19  ( 1 ) " "      "*"       "*"  " "      "*"  "*"       "*"        
## 20  ( 1 ) " "      "*"       "*"  " "      "*"  "*"       "*"        
## 21  ( 1 ) "*"      "*"       "*"  " "      "*"  "*"       "*"        
## 22  ( 1 ) "*"      "*"       "*"  "*"      "*"  "*"       "*"        
##           tourism
## 1  ( 1 )  " "    
## 2  ( 1 )  "*"    
## 3  ( 1 )  "*"    
## 4  ( 1 )  "*"    
## 5  ( 1 )  "*"    
## 6  ( 1 )  "*"    
## 7  ( 1 )  "*"    
## 8  ( 1 )  "*"    
## 9  ( 1 )  "*"    
## 10  ( 1 ) "*"    
## 11  ( 1 ) "*"    
## 12  ( 1 ) "*"    
## 13  ( 1 ) "*"    
## 14  ( 1 ) "*"    
## 15  ( 1 ) "*"    
## 16  ( 1 ) "*"    
## 17  ( 1 ) "*"    
## 18  ( 1 ) "*"    
## 19  ( 1 ) "*"    
## 20  ( 1 ) "*"    
## 21  ( 1 ) "*"    
## 22  ( 1 ) "*"
plot(forward_model)

plot(varImp(forward_model))

# compare all the models
plot(forward_model$finalModel, scale = 'adjr2')

# backward_selection
backward_model = train(rentals ~ ., 
                       data = na.omit(train),  
                       method = 'leapBackward',
                       preProcess = c('center', 'scale'),
                       tuneGrid = expand.grid(nvmax = 1:23),
                       trControl = trainControl(method = 'cv', number = 5)) 
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
plot(backward_model)

plot(backward_model$finalModel, scale = 'adjr2')

plot(varImp(backward_model, scale = TRUE))

# steps in both directions
hybrid_model = train(rentals ~ ., 
                     data = na.omit(train),  
                     method = 'leapSeq',
                     preProcess = c('center', 'scale'),
                     tuneGrid = expand.grid(nvmax = 1:23),
                     trControl = trainControl(method = 'cv', number = 5)) 
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
plot(hybrid_model)

plot(hybrid_model$finalModel, scale = 'adjr2')

plot(varImp(hybrid_model))

Shrinkage

Ridge regression

# ridge regression
ridge_model = train(rentals ~ ., 
                    data = train, 
                    method = 'ridge',
                    preProcess = c('center', 'scale'),
                    tuneLength = 10,
                    # reducing the cv for speed
                    trControl = trainControl(method = 'cv', number = 5))

print(ridge_model)
## Ridge Regression 
## 
## 17544 samples
##    23 predictor
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 14035, 14034, 14036, 14035, 14036 
## 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE      Rsquared   RMSE SD    Rsquared SD
##   0.0000000000  22.91761  0.3096930  0.1665859  0.01280914 
##   0.0001000000  22.91760  0.3096934  0.1665527  0.01281036 
##   0.0002371374  22.91760  0.3096939  0.1665074  0.01281202 
##   0.0005623413  22.91759  0.3096949  0.1664011  0.01281595 
##   0.0013335214  22.91758  0.3096965  0.1661550  0.01282512 
##   0.0031622777  22.91763  0.3096956  0.1656028  0.01284618 
##   0.0074989421  22.91809  0.3096730  0.1644485  0.01289259 
##   0.0177827941  22.92037  0.3095484  0.1623692  0.01298722 
##   0.0421696503  22.92825  0.3091161  0.1596038  0.01315668 
##   0.1000000000  22.94861  0.3081006  0.1582659  0.01340910 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was lambda = 0.001333521.
plot(ridge_model)

plot(ridge_model$finalModel)

plot(varImp(ridge_model))

# get the coefficients for the model
# NOTE: shrinkage methods don't have intercept terms
ridge_coefs = predict(ridge_model$finalModel, type = 'coef', mode = 'norm')$coefficients

# ridge regression with variable selection
ridge_model2 = train(rentals ~ ., 
                     data = train, 
                     method = 'foba',
                     preProcess = c('center', 'scale'),
                     tuneLength = 10,
                     trControl = trainControl(method = 'cv', number = 5))

print(ridge_model2)
## Ridge Regression with Variable Selection 
## 
## 17544 samples
##    23 predictor
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 14033, 14036, 14036, 14035, 14036 
## 
## Resampling results across tuning parameters:
## 
##   lambda        k   RMSE      Rsquared   RMSE SD    Rsquared SD
##   1.000000e-05   2  24.16441  0.2321921  0.2759654  0.01442891 
##   1.000000e-05   4  23.22468  0.2907670  0.3494591  0.01535095 
##   1.000000e-05   6  23.08243  0.2994412  0.3349153  0.01571250 
##   1.000000e-05   9  23.04479  0.3017099  0.3377533  0.01598847 
##   1.000000e-05  11  22.99470  0.3047225  0.3331931  0.01467418 
##   1.000000e-05  13  22.96824  0.3063260  0.3276297  0.01504069 
##   1.000000e-05  16  22.95721  0.3069987  0.3194928  0.01454891 
##   1.000000e-05  18  22.93837  0.3081493  0.3210620  0.01451841 
##   1.000000e-05  20  22.90964  0.3098953  0.3353790  0.01543189 
##   1.000000e-05  23  22.91341  0.3096724  0.3382143  0.01558103 
##   2.782559e-05   2  24.16441  0.2321921  0.2759646  0.01442891 
##   2.782559e-05   4  23.22468  0.2907670  0.3494561  0.01535091 
##   2.782559e-05   6  23.08243  0.2994412  0.3349121  0.01571245 
##   2.782559e-05   9  23.04479  0.3017098  0.3377501  0.01598844 
##   2.782559e-05  11  22.99470  0.3047225  0.3331914  0.01467417 
##   2.782559e-05  13  22.96824  0.3063258  0.3276323  0.01504087 
##   2.782559e-05  16  22.95721  0.3069986  0.3194955  0.01454911 
##   2.782559e-05  18  22.93838  0.3081492  0.3210652  0.01451862 
##   2.782559e-05  20  22.90963  0.3098954  0.3353819  0.01543203 
##   2.782559e-05  23  22.91341  0.3096724  0.3382184  0.01558111 
##   7.742637e-05   2  24.16441  0.2321921  0.2759624  0.01442890 
##   7.742637e-05   4  23.22468  0.2907670  0.3494479  0.01535079 
##   7.742637e-05   6  23.08243  0.2994412  0.3349031  0.01571230 
##   7.742637e-05   9  23.04479  0.3017098  0.3377411  0.01598834 
##   7.742637e-05  11  22.99470  0.3047224  0.3331866  0.01467412 
##   7.742637e-05  13  22.96824  0.3063255  0.3276395  0.01504137 
##   7.742637e-05  16  22.95721  0.3069984  0.3195030  0.01454966 
##   7.742637e-05  18  22.93838  0.3081488  0.3210741  0.01451923 
##   7.742637e-05  20  22.90963  0.3098955  0.3353900  0.01543244 
##   7.742637e-05  23  22.91341  0.3096726  0.3382302  0.01558152 
##   2.154435e-04   2  24.16441  0.2321921  0.2759565  0.01442889 
##   2.154435e-04   4  23.22468  0.2907669  0.3494250  0.01535045 
##   2.154435e-04   6  23.08243  0.2994411  0.3348783  0.01571190 
##   2.154435e-04   9  23.04479  0.3017096  0.3377162  0.01598808 
##   2.154435e-04  11  22.99470  0.3047223  0.3331734  0.01467399 
##   2.154435e-04  13  22.96826  0.3063245  0.3276597  0.01504276 
##   2.154435e-04  16  22.95722  0.3069976  0.3195241  0.01455118 
##   2.154435e-04  18  22.93839  0.3081479  0.3210988  0.01452090 
##   2.154435e-04  20  22.90961  0.3098958  0.3354125  0.01543355 
##   2.154435e-04  23  22.91339  0.3096732  0.3382628  0.01558268 
##   5.994843e-04   2  24.16441  0.2321921  0.2759400  0.01442885 
##   5.994843e-04   4  23.22468  0.2907668  0.3493614  0.01534951 
##   5.994843e-04   6  23.08243  0.2994410  0.3348093  0.01571076 
##   5.994843e-04   9  23.04479  0.3017091  0.3376469  0.01598736 
##   5.994843e-04  11  22.99469  0.3047221  0.3331366  0.01467363 
##   5.994843e-04  13  22.96830  0.3063217  0.3277163  0.01504665 
##   5.994843e-04  16  22.95725  0.3069955  0.3195833  0.01455544 
##   5.994843e-04  18  22.93843  0.3081453  0.3211678  0.01452558 
##   5.994843e-04  20  22.90957  0.3098975  0.3354735  0.01543562 
##   5.994843e-04  23  22.91334  0.3096756  0.3383560  0.01558474 
##   1.668101e-03   2  24.16442  0.2321921  0.2758943  0.01442873 
##   1.668101e-03   4  23.22469  0.2907662  0.3491848  0.01534691 
##   1.668101e-03   6  23.08244  0.2994405  0.3346179  0.01570762 
##   1.668101e-03   9  23.04479  0.3017077  0.3374545  0.01598535 
##   1.668101e-03  11  22.98837  0.3051108  0.3341303  0.01527012 
##   1.668101e-03  13  22.96843  0.3063137  0.3278780  0.01505759 
##   1.668101e-03  16  22.96027  0.3068181  0.3193561  0.01464686 
##   1.668101e-03  18  22.93854  0.3081377  0.3213624  0.01453863 
##   1.668101e-03  20  22.90949  0.3098997  0.3356359  0.01544221 
##   1.668101e-03  23  22.91329  0.3096767  0.3385955  0.01559395 
##   4.641589e-03   2  24.16448  0.2321921  0.2757684  0.01442841 
##   4.641589e-03   4  23.22483  0.2907642  0.3486969  0.01533966 
##   4.641589e-03   6  23.08257  0.2994387  0.3340906  0.01569890 
##   4.641589e-03   9  23.04490  0.3017030  0.3369241  0.01597991 
##   4.641589e-03  11  22.98846  0.3051095  0.3338080  0.01527004 
##   4.641589e-03  13  22.97542  0.3058984  0.3417399  0.01562629 
##   4.641589e-03  16  22.96060  0.3068010  0.3198274  0.01467872 
##   4.641589e-03  18  22.93898  0.3081144  0.3219154  0.01457510 
##   4.641589e-03  20  22.90952  0.3098919  0.3360699  0.01545568 
##   4.641589e-03  23  22.91310  0.3096869  0.3395502  0.01563514 
##   1.291550e-02   2  24.16501  0.2321921  0.2754286  0.01442752 
##   1.291550e-02   4  23.22593  0.2907544  0.3473658  0.01531957 
##   1.291550e-02   6  23.08366  0.2994297  0.3326622  0.01567485 
##   1.291550e-02   9  23.02771  0.3028091  0.3463873  0.01652423 
##   1.291550e-02  11  22.98949  0.3050983  0.3329332  0.01526908 
##   1.291550e-02  13  22.97624  0.3058990  0.3404989  0.01561102 
##   1.291550e-02  16  22.96235  0.3067443  0.3211597  0.01476606 
##   1.291550e-02  18  22.93745  0.3082666  0.3287773  0.01494385 
##   1.291550e-02  20  22.91138  0.3098055  0.3374303  0.01550510 
##   1.291550e-02  23  22.91409  0.3096441  0.3391623  0.01562220 
##   3.593814e-02   2  24.16899  0.2321920  0.2745598  0.01442513 
##   3.593814e-02   4  23.23396  0.2906973  0.3438570  0.01526427 
##   3.593814e-02   6  23.09181  0.2993769  0.3289682  0.01560958 
##   3.593814e-02   9  23.03531  0.3027417  0.3427833  0.01649648 
##   3.593814e-02  11  22.99625  0.3050558  0.3312629  0.01524488 
##   3.593814e-02  13  22.97477  0.3060977  0.3347229  0.01539609 
##   3.593814e-02  16  22.96066  0.3069590  0.3271624  0.01546492 
##   3.593814e-02  18  22.94541  0.3078998  0.3414731  0.01566311 
##   3.593814e-02  20  22.92076  0.3094149  0.3391434  0.01558767 
##   3.593814e-02  23  22.92076  0.3094149  0.3391434  0.01558767 
##   1.000000e-01   2  24.19607  0.2321918  0.2726591  0.01441895 
##   1.000000e-01   4  23.24676  0.2907311  0.3397342  0.01518991 
##   1.000000e-01   6  23.11912  0.2985650  0.3529153  0.01679539 
##   1.000000e-01   9  23.04864  0.3026729  0.3419578  0.01656237 
##   1.000000e-01  11  23.01178  0.3049311  0.3320891  0.01537461 
##   1.000000e-01  13  23.00910  0.3050803  0.3324609  0.01557251 
##   1.000000e-01  16  23.00910  0.3050803  0.3324609  0.01557251 
##   1.000000e-01  18  23.00910  0.3050803  0.3324609  0.01557251 
##   1.000000e-01  20  23.00910  0.3050803  0.3324609  0.01557251 
##   1.000000e-01  23  23.00910  0.3050803  0.3324609  0.01557251 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were k = 20 and lambda = 0.001668101.
plot(ridge_model2)

plot(varImp(ridge_model2))

Selection, ridge regression, and lasso are just a couple techniques at our disposal for decreasing our model size. See this page for a list of other available options to try out if you like.

Lasso

lasso_model = train(rentals ~ ., 
                    data = na.omit(train),
                    method = 'lasso',
                    preProc = c('scale', 'center'),
                    tuneLength = 10,
                    trControl = trainControl(method = 'cv', number = 5))

print(lasso_model)
## The lasso 
## 
## 17544 samples
##    23 predictor
## 
## Pre-processing: scaled, centered 
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 14035, 14036, 14036, 14035, 14034 
## 
## Resampling results across tuning parameters:
## 
##   fraction   RMSE      Rsquared   RMSE SD    Rsquared SD
##   0.1000000  25.29124  0.2329375  1.3265385  0.03973063 
##   0.1888889  24.42886  0.2699559  0.9687530  0.01940618 
##   0.2777778  23.78761  0.2882062  0.7871721  0.01335550 
##   0.3666667  23.36073  0.2938053  0.7266509  0.01269411 
##   0.4555556  23.13016  0.2999341  0.7261290  0.01280352 
##   0.5444444  23.02360  0.3038177  0.7314222  0.01334846 
##   0.6333333  22.98007  0.3058203  0.7354855  0.01351454 
##   0.7222222  22.95372  0.3073065  0.7426890  0.01381991 
##   0.8111111  22.93304  0.3085224  0.7490337  0.01421347 
##   0.9000000  22.91961  0.3093079  0.7539243  0.01452186 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was fraction = 0.9.
plot(lasso_model)

plot(varImp(lasso_model))

plot(lasso_model$finalModel)

# get the model coefficients
lasso_coefs = predict(lasso_model$finalModel, type = 'coef', mode = 'norm')$coefficients

Measuring predictive accuracy

All right, now we’ve got a nice collection of models. Which one should we report?

results = resamples(list(forward_selection = forward_model, 
                               backward_selection = backward_model, 
                               hybrid_selection = hybrid_model,
                               ridge_regression = ridge_model,
                               lasso_regeression = lasso_model))

# compare RMSE and R-squared
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: forward_selection, backward_selection, hybrid_selection, ridge_regression, lasso_regeression 
## Number of resamples: 5 
## 
## RMSE 
##                     Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## forward_selection  24.06   24.37  25.22 24.86   25.26 25.41    0
## backward_selection 24.29   24.88  24.89 24.86   25.05 25.20    0
## hybrid_selection   24.17   24.63  24.80 24.86   25.03 25.66    0
## ridge_regression   22.00   22.43  23.26 22.92   23.37 23.54    0
## lasso_regeression  22.14   22.59  22.79 22.91   22.86 24.18    0
## 
## Rsquared 
##                      Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## forward_selection  0.1514  0.1920 0.1939 0.1877  0.1951 0.2059    0
## backward_selection 0.1770  0.1864 0.1891 0.1872  0.1913 0.1920    0
## hybrid_selection   0.1768  0.1868 0.1880 0.1873  0.1914 0.1934    0
## ridge_regression   0.2728  0.3120 0.3137 0.3092  0.3204 0.3270    0
## lasso_regeression  0.2985  0.3003 0.3094 0.3097  0.3168 0.3236    0
# plot results
dotplot(results)

Those are in-sample statistics however, so if we want to compare the model’s out-of-sample prediction accuracy, we need to compute the RMSE using the test data we held out. Let’s compare two models: backward selection and lasso:

backward_predictions = predict(backward_model, test)
sqrt(mean((backward_predictions - test$rentals)^2 , na.rm = TRUE))
## [1] 22.3518
lasso_predictions = predict(lasso_model, test)
sqrt(mean((lasso_predictions - test$rentals)^2 , na.rm = TRUE))
## [1] 22.34778

Project tips

Check out this list of different model selection methods and try a couple out.

Once you’ve spent some time exploring candidate models, pick one and use it in your report.